DATA SCIENCE SALARY ANALYSIS 2020-2022¶

Date Created March 2023

1. Why this project?¶

To answer this question, I am going to talk a bit about myself. I am a recent graduate in Business Analytics or Data Science in general, and I would like to kick off the process of hunting jobs, therefore, it is wise to find out more about salary for salary expectation and negotiation. Another reason is part of my curiosity as an ex-payoll specialist to understand current compensation trend, how rewarding for a senior position and much more.

All in all, this project aims to:

-Provide historical salary information to new graduates

-Supply salary range between different lelvels to current employees

-Encourage data nerds on their journey by seeing the potential of salary increment.

So, let's dive in!

2. Input¶

I downloaded the dataset from Kaggle for this analysis

In [5]:
#Import libraries

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px
In [6]:
#Import dataset

df=pd.read_csv(r"C:\Users\huong\OneDrive\Máy tính\4. JOB RELATED MATTERS\2. Python\Project 2 - Portfolio\ds_salaries.csv")
df
Out[6]:
Unnamed: 0 work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio company_location company_size
0 0 2020 MI FT Data Scientist 70000 EUR 79833 DE 0 DE L
1 1 2020 SE FT Machine Learning Scientist 260000 USD 260000 JP 0 JP S
2 2 2020 SE FT Big Data Engineer 85000 GBP 109024 GB 50 GB M
3 3 2020 MI FT Product Data Analyst 20000 USD 20000 HN 0 HN S
4 4 2020 SE FT Machine Learning Engineer 150000 USD 150000 US 50 US L
... ... ... ... ... ... ... ... ... ... ... ... ...
602 602 2022 SE FT Data Engineer 154000 USD 154000 US 100 US M
603 603 2022 SE FT Data Engineer 126000 USD 126000 US 100 US M
604 604 2022 SE FT Data Analyst 129000 USD 129000 US 0 US M
605 605 2022 SE FT Data Analyst 150000 USD 150000 US 100 US M
606 606 2022 MI FT AI Scientist 200000 USD 200000 IN 100 US L

607 rows × 12 columns

In [7]:
#Have a glance

df.shape
Out[7]:
(607, 12)

Data cleaning¶

  1. Drop unused variables
  2. Check missing values (if any)
  3. Deal with outliers (if any)
  4. Remove duplicates
In [8]:
#1.Drop the following columns: unnamed:0, salary and salary_currency as they are redundant information. 
#We have colum salary_in_usd which carries salary infotamation in one currency which will be useful.

df.drop(['Unnamed: 0','salary_currency','salary'], axis="columns", inplace=True)
df
Out[8]:
work_year experience_level employment_type job_title salary_in_usd employee_residence remote_ratio company_location company_size
0 2020 MI FT Data Scientist 79833 DE 0 DE L
1 2020 SE FT Machine Learning Scientist 260000 JP 0 JP S
2 2020 SE FT Big Data Engineer 109024 GB 50 GB M
3 2020 MI FT Product Data Analyst 20000 HN 0 HN S
4 2020 SE FT Machine Learning Engineer 150000 US 50 US L
... ... ... ... ... ... ... ... ... ...
602 2022 SE FT Data Engineer 154000 US 100 US M
603 2022 SE FT Data Engineer 126000 US 100 US M
604 2022 SE FT Data Analyst 129000 US 0 US M
605 2022 SE FT Data Analyst 150000 US 100 US M
606 2022 MI FT AI Scientist 200000 IN 100 US L

607 rows × 9 columns

In [9]:
#2.Check if any missing values - No missing values at all!

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607 entries, 0 to 606
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           607 non-null    int64 
 1   experience_level    607 non-null    object
 2   employment_type     607 non-null    object
 3   job_title           607 non-null    object
 4   salary_in_usd       607 non-null    int64 
 5   employee_residence  607 non-null    object
 6   remote_ratio        607 non-null    int64 
 7   company_location    607 non-null    object
 8   company_size        607 non-null    object
dtypes: int64(3), object(6)
memory usage: 42.8+ KB
In [10]:
#3.Check outliers using boxplot
#There is one observation of 600,000USD package for a full-time Principle Data Engineer
#This datapoint is far from others but it is possible scenario on the market in 2021
#Hence we will keep this outlier

df.boxplot(column=["salary_in_usd"])
Out[10]:
<AxesSubplot:>
In [11]:
#4.Check and remove duplicate
#There are no duplicates - we still have 607 observations

df['salary_in_usd'].drop_duplicates().sort_values(ascending=True)
df.drop_duplicates()
Out[11]:
work_year experience_level employment_type job_title salary_in_usd employee_residence remote_ratio company_location company_size
0 2020 MI FT Data Scientist 79833 DE 0 DE L
1 2020 SE FT Machine Learning Scientist 260000 JP 0 JP S
2 2020 SE FT Big Data Engineer 109024 GB 50 GB M
3 2020 MI FT Product Data Analyst 20000 HN 0 HN S
4 2020 SE FT Machine Learning Engineer 150000 US 50 US L
... ... ... ... ... ... ... ... ... ...
602 2022 SE FT Data Engineer 154000 US 100 US M
603 2022 SE FT Data Engineer 126000 US 100 US M
604 2022 SE FT Data Analyst 129000 US 0 US M
605 2022 SE FT Data Analyst 150000 US 100 US M
606 2022 MI FT AI Scientist 200000 IN 100 US L

565 rows × 9 columns

Data visualisations¶

In [12]:
#Visual 1: which roles have been most popular in the field?

df1=df['job_title'].value_counts().head(10)
df1
Out[12]:
Data Scientist                143
Data Engineer                 132
Data Analyst                   97
Machine Learning Engineer      41
Research Scientist             16
Data Science Manager           12
Data Architect                 11
Big Data Engineer               8
Machine Learning Scientist      8
Principal Data Scientist        7
Name: job_title, dtype: int64
In [13]:
fig=px.bar(df1,x=df1.index,y=df1.values, text=df1.values,labels={'index':'Job Title','y':'Count'},title='<b>Top 10 popular roles in Data Science')
fig.show()
In [14]:
#Insight: Data Scientist is the most popular role in last three years. If anyone targets this role, then the competition would be less since it is highly likely to have more job listings.
#Data Analyst, which is the role that I am targeting, is also in demand, which is a good sign.
In [15]:
#Visual 2: Which roles attract the highest compensation package?

df2=df.sort_values(by='salary_in_usd',ascending=False).head(10)
df2
Out[15]:
work_year experience_level employment_type job_title salary_in_usd employee_residence remote_ratio company_location company_size
252 2021 EX FT Principal Data Engineer 600000 US 100 US L
97 2021 MI FT Financial Data Analyst 450000 US 100 US L
33 2020 MI FT Research Scientist 450000 US 0 US M
157 2021 MI FT Applied Machine Learning Scientist 423000 US 50 US L
225 2021 EX CT Principal Data Scientist 416000 US 100 US S
63 2020 SE FT Data Scientist 412000 US 100 US L
523 2022 SE FT Data Analytics Lead 405000 US 100 US L
519 2022 SE FT Applied Data Scientist 380000 US 100 US L
25 2020 EX FT Director of Data Science 325000 US 100 US L
482 2022 EX FT Data Engineer 324000 US 100 US M
In [16]:
fig=px.bar(df2,x='job_title',y='salary_in_usd',color='job_title',title='<b>Top 10 highest paid roles in Data Science')

fig.show()
In [17]:
#Visual 3: Where on earth has had the highest demand for Data Science people?

df3=df['company_location'].value_counts().head(10)
df3
Out[17]:
US    355
GB     47
CA     30
DE     28
IN     24
FR     15
ES     14
GR     11
JP      6
NL      4
Name: company_location, dtype: int64
In [18]:
fig=px.bar(df3,x=df3.index,y=df3.values,text=df3.values, labels={'index':"Country",'y':'Count'},title='<b>Top 10 countries that have more jobs in Data Science')
fig.show()
In [19]:
#Insight: The US is certainly a tech hub where it created the far highest number of data science jobs, compared to others.
#Relocation and remote working for US companies would be ideal for those who are looking for jobs. It does not always have to base in one location.
In [20]:
#Visual 4: Is the trend of remote working in Data Science a hot topic?

df4=df['remote_ratio'].value_counts().head(10)
df4
Out[20]:
100    381
0      127
50      99
Name: remote_ratio, dtype: int64
In [21]:
fig=px.bar(df4,x=df4.index,y=df4.values,labels={'index':'Remote Work Ratio','y':'Count'},title='<b>The trend of working from home in Data Science')

fig.show()
In [22]:
#Insights: So this bar chart confirms that remote working option has been the fashion to attract talents, given the number of jobs providing this flexibility
In [23]:
#Visual 5: Did we have lots of jobs for Entry Level people? New graduates badly need this information.

fig=px.pie(df.groupby('experience_level',as_index=False)['salary_in_usd'].count()
        ,names='experience_level',values='salary_in_usd',color='experience_level'
       ,color_discrete_sequence=px.colors.sequential.RdBu
        ,title='<b>Percentage of jobs based on experience level')

fig.show()
In [26]:
#Insights: We did not have a high percentage of jobs for Entry Level or Junior people, compared to Mid-level or Senior.
#This is a bad news, however, it is good to keep in mind.
#People who is new to data should always upgrade their kills to climb up, to reduce the competition.
In [24]:
#Visual 6: A closer look at median, mean, min, max salary for each level.

px.box(df,x='experience_level',y='salary_in_usd'
       ,color='experience_level'
       ,template='ggplot2'
       ,labels={'experience_level':'Experience Level','salary_in_usd':'salary in usd'}
       ,title='<b>Data Science salaries by experince level')
In [27]:
#Visual 7: So what are the employment types? Which one is more popular?

fig=px.pie(df.groupby('employment_type',as_index=False)['salary_in_usd'].count()
        ,names='employment_type',values='salary_in_usd',color='employment_type'
          ,hole=0.5
           ,title='<b>Percentage of jobs based on employment type')
fig.show()
In [28]:
#Insights: surprisingly, full-time based employees are seeked the most.It seems companies who employed would prefer to build a team that is working full-time hours rather than part-time hours or contract-based.
In [29]:
#Visual 8: How is the size of employing companies? How big are they to afford to have data-related employees on board?

fig=px.pie(df.groupby('company_size',as_index=False)['salary_in_usd'].count()
           ,names='company_size',values='salary_in_usd'
           ,hole=0.7
           ,title='<b> Percentage of companies by size')
fig.show()
In [30]:
#Insights: So, to create a team of data analytics, the company tend to be from medium to large rather than in small scale.
In [37]:
#Visual 9: more statistics about salary by the years

px.violin(df,x='work_year',y='salary_in_usd',color='work_year',labels={'work_year':'year','salary_in_usd':'salary in usd'},template='seaborn',title='<b>Data Science Salaries by year')
In [32]:
#Visual 10: Now, how has the average salary changed over years, expecially when pandemic hit?

x=['2020','2021','2022']

fig=px.line(df,x,y=df.groupby('work_year')['salary_in_usd'].mean()
            ,labels={'x':'Year','y':"Average Salary"}
            ,title="<b>Average salary of Data Science from 2020 to 2022")
fig.show()
In [35]:
#Insight: wow, there was a significant increment for average salary from 2021 to 2022, from around $100k to $122k per year.
#This is a huge and promising for data-nerds. 
#Reference from https://www.coursera.org/articles/data-scientist-salary, salary is expected to be $125k in 2023 in the US

3.Conclusion¶

So to conclude this analysis, here are the takeaways or insights extracted from the dataset:

1.US is the country that has had the highest number of jobs.

2.Full-time employees could maximize their earnings more than Part-time or Contract-based.

3.Data Scientist is the most seeked role in Data Field in the past three years.

4.Remote working flexibility has continued to be the fashion.

5.There is a huge increment in average salary from 2020 to 2022, which is expected to rise up to $125k in 2023 according to Courera.

I hope these facts are helpful for anyone who is seeking for first job in 2023 or struggling to break in the field, or feeling demotivated by how much effort to put in to be employed.

Thank you!

In [ ]: